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Abstract 

Background: The banana family (Musaceae) includes genetically a diverse group of species and their diploid and polyploid 
hybrids that are widely cultivated in the tropics. In spite of their socio-economic importance, the knowledge of Musaceae 
genomes is basically limited to draft genome assemblies of two species, Musa acuminata and M. balbisiana. Here we aimed 
to complement this information by analyzing repetitive genome fractions of six species selected to represent various 
phylogenetic groups within the family. 

Results: Low-pass sequencing of M. acuminata, M. ornata, M. textilis, M. beccarii, M. balbisiana, and Ensete gilletii genomes 
was performed using a 454/Roche platform. Sequence reads were subjected to analysis of their overall intra- and inter- 
specific similarities and, all major repeat families were quantified using graph-based clustering. Maximus/SIRE and Angela 
lineages of Tyl/copia long terminal repeat (LTR) retrotransposons and the chromovirus lineage of Ty3/gypsy elements were 
found to make up most of highly repetitive DNA in all species (14-34.5% of the genome). However, there were quantitative 
differences and sequence variations detected for classified repeat families as well as for the bulk of total repetitive DNA. 
These differences were most pronounced between species from different taxonomic sections of the Musaceae family, 
whereas pairs of closely related species {M. acuminata/M. ornata and M. beccarii/M. textilis) shared similar populations of 
repetitive elements. 

Conclusions: This study provided the first insight into the composition and sequence variation of repetitive parts of 
Musaceae genomes. It allowed identification of repetitive sequences specific for a single species or a group of species that 
can be utilized as molecular markers in breeding programs and generated computational resources that will be 
instrumental in repeat masking and annotation in future genome assembly projects. 
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Introduction 

Bananas are giant perennial herbs belonging to the genus Musa, 
which are grown in tropical and subtropical regions. Edible sweet 
and starchy banana cultivars provide a staple food for many 
millions of people and are a major export commodity. Based on a 
set of morphological descriptors and basic chromosome number, 
the genus Musa has been traditionally subdivided into four 
sections: Eumusn (x=ll), Rhodochlamys (x=ll), Australimusa 
(x = 10), and Callimusa (x = 9 or 10) [1] . However, this classification 
has been often questioned. The recent use of a variety of molecular 
markers provided detailed information on Musa genetic diversity 
and phylogenesis [2-8]. Finally, in 2013, the Australimusa and 
Callimusa sections were merged into the section Callimusa, and 
sections Eumusa and Rhodochlamys were merged into the section 
Musa [9]. 



Most of the edible bananas are vegetatively propagated diploid 
and polyploid forms of M. acuminata (A genome, 2n — 2x — 22) and 
hybrids that originated from crosses between M. acuminata and M. 
balbisiana (B genome, 2n = 2x = 22) [10] belonging to the section 
Musa. Although there is some inconsistency in the classification of 
cultivated banana clones, it has been estimated that there are at 
least 1000 different cultivars grown worldwide [11]. Another 
group of edible cultivars, called Fei, represents a separate line of 
banana evolution and comprises a different species complex 
classified in the section Callimusa. The section is represented by a 
number of wild Musa species, including M. maclayi, M. peekelii, and 
M. lolodensis, the most probable progenitors of edible Fei bananas 
[12-15] and M. textilis (T genome). Fei bananas are partheno- 
carpic and vegetatively propagated like other edible banana 
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clones. However, they were domesticated independendy from the 
cultivars of the section Musa. 

The production of bananas has been seriously threatened by the 
increasing range of fungal, viral, and insect diseases. At the same 
time, breeding of improved cultivars is hampered by seed sterility, 
unknown origin of the cultivated clones, and poor knowledge on 
genetic diversity of the genus Musa. The nuclear genome of Musa 
species is relatively small (1C~600 Mbp; [16,17]), and previous 
studies showed that ~55% of the genome is represented by 
repetitive DNA, especially different types of retroelements [18- 
20] . Genomic repeats evolve more rapidly than coding sequences, 
and plant geneticists and breeders found them a unique source of 
molecular markers to map important genes, analyze genetic 
diversity, and study processes of speciation and genome evolution 
[21-23]. 

Global characterization of complex populations of plant 
genomic repeats recently has been made feasible by combining 
next-generation sequencing technologies with newly developed 
bioinformatic tools [24,25]. This approach also led to the first 
characterization of major repeat types in the genome of 'Calcutta', 
a clone of M. acuminata ssp. burmannica, a popular male parent in a 
number of breeding programs [19]. Additional information about 
repeat composition and organization in this species (ssp. malaccensis) 
was obtained after producing a draft sequence of a double haploid 
individual derived from wild accession 'Pahang' [20]. Finally, 
initial data about repeats in another species, M. balbisiana (variety 
'Pisang Klutuk Wulung') recendy have been released along with its 
draft genome [26]. Apart from M. acuminata and M. balbisiana from 
the section Musa, similar information is missing for the section 
Callimusa and other representatives of the Musaceae family. 

In this study, we employed bioinformatic analysis of low-pass 
genome sequencing data to get deep insight into repeat 
composition of Musaceae genomes. We selected five Musa species 
and one representative of the genus Ensete for comparative analysis 
of repetitive fractions of their genomes in order to (1) identify and 
quantify major groups of repetitive sequences, (2) assess sequence 
diversity of repeats between the species and investigate its 
correlation with the phylogeny of the Musaceae family, and (3) 
generate bioinformatic resources for development of repeat-based 
genome-specific markers and for repeat identification and 
annotation for future genome assembly projects. 

Results 

Low-pass genome sequencing, estimation of repeat 
proportions, and similarities between the species 

Five representatives of the genus Musa and one Ensete species 
were selected for analysis to cover various phylogenetic clades of 
the Musaceae family (Fig. 1). They possess relatively small 
genomes with only moderate differences between the species, 
ranging from 567 to 763 Mbp/ 1C [16,17,27]. Whole-genome 
shotgun sequencing was performed using Roche/454 technology, 
and resulting reads were trimmed to the same length of 200 
nucleotides. The same amount of reads (380,599) from each 
species was used for analysis, providing 0.10-0.13 x genome 
coverage (Table 1). For this coverage, the probability of detecting 
repetitive sequences with 10 and 100 copies per haploid genome 
was 63-74% and >99.9%, respectively [28]. 

Sequence reads derived from genomic repeats were identified 
and quantified based on the number of similarity hits generated in 
all-to-all read comparisons. In principle, low sequencing coverage 
and similarity threshold used in this analysis (90% identity over 
55% of the sequence length) provide a small chance of detecting 
hits between single-copy genomic sequences. Thus, most of the 
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Figure 1. Evolutionary relationship between species of Musa- 
ceae family. Phylogeny estimated from ITS data using BioNJ. Six 
genomes selected for repeat analysis are highlighted. 
doi:10.1371/journal.pone.0098918.g001 

similarities are expected between the reads representing repetitive 
sequences and, their frequencies are proportional to copy number 
of corresponding repetitive elements in the genome. In the first 
part of the analysis, similarity hits were investigated separately for 
reads from each species compared to themselves, thus providing 
information about repeat proportions in individual genomes. 
There were similar amounts (55-60%) of reads generating at least 
one similarity hit in all species. However, there were differences in 
proportions of moderately (> 1 00 copies/ 1C) and especially of 
high-copy (> 1 000 copies/ 1C) repeats that were most abundant in 
M. beccarii, M. acuminata, and M. ornata (Fig. 2A). Higher 
proportions of high-copy repeats in these species were also evident 
from differences in total numbers of similarity hits (Fig. 2B). 

To evaluate overall differences in sequence composition and 
abundance between pairs of species, inter-specific comparisons of 
read similarities were performed and visualized in a form of scatter 
plots, where dots represented reads and their positions were 
determined by numbers of similarity hits in both species (Fig. 3). 
This analysis revealed remarkable similarity of repeat composition 
of M. acuminata and M. ornata genomes, resulting in a diagonal 
pattern of the dot positions, which was due to similar abundance of 
corresponding sequences in both genomes (Fig. 30). High 
similarity was also revealed between repeats from M. beccarii and 
M. textilis, except for much higher genomic proportion of 45S 
rDNA repeats in M. beccarii (Fig. 31). Quantification of rDNA 
proportions revealed that its differential amplification accounts for 
most of the genome size difference between these species. The 
detected quantitative differences in 45 S rDNA content are also in 
agreement with previously published FISH experiments which 
have shown higher number of 45S rDNA sites in M. beccarii and E. 
gilletii [17]. On the other hand, repeat composition of is. gilletii was 
the most divergent from other species (Fig. 3A-E). The diagram in 
Fig. 3P summarizing all similarities demonstrates this divergence 
of E. gilletii from all other species and also shows a closer 
relationship of M. balbisiana to the M. acuminata! M. ornata group. 
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Classification and comparative analysis of major groups 
of repeats 

To classify repetitive sequences and identify their orthologous 
groups present in individual genomes, the identified intra- and 
inter-specific similarities of sequence reads were analyzed using the 
RepeatExplorer pipeline [25]. The pipeline runs graph-based 
clustering algorithm [24] to identify groups of frequently over- 
lapping reads representing families of repetitive elements followed 
by similarity- and structure-based repeat identification tools that 
aid in repeat classification. Classification of repeats in the 106 
largest clusters exceeding in size 0.03% of the analyzed reads 
revealed that Tyl /copia, Ty3/ gypsy, and rDNA repeats make up the 
majority of highly and moderately repeated sequences in the 
Musaceae genomes (Table 2, Fig. SI). Tyl /copia LTR-retro- 
transposons, mainly represented by Maximus and Angela lineages, 
were 2-4-fold more abundant than Ty3 /gypsy. However, propor- 
tions of most repeat groups varied between the species even for the 
most abundant elements like Tyl I copia Maximus, which was much 
less abundant in E. gilktii and showed almost 1.5-fold variation in 
abundance between Musa species. Depending on the genome size 
and repeat content, the annotated repeats corresponded to 24% 
(E. gilktii) up to 44% (M. beccarii) of the genome. The rest of the 
repeats including mainly low-copy sequences forming small repeat 
clusters made up 26-43% (Table 2). In summary, repetitive 
sequences occupy about 66-71% in all genomes. 

An inherent feature of the clustering analysis when applied to 
sequence data from multiple species is that orthologous repeat 
families from different species are grouped to the same clusters. 
This facilitates identification and quantification of repeats that are 
shared between the species as well as detection of species-specific 
sequences. The analysis revealed that a large part of the repeat 
clusters representing various families of LTR-retrotransposons, 
DNA transposons, long interspersed nuclear elements (LINE), and 
rDNA was shared by all Musa species and, to a smaller extent, also 
with E. gilktii (Fig. 4A, group 4). This group of clusters was also the 
most significant in terms of genome proportions, containing about 
32% of analyzed reads (89% of annotated repeats). Smaller groups 
3, 5, and 7 included clusters that were shared by two or three 
species. They comprised mostiy non-coding parts of LTR- 
retrotransposons (e.g., LTR sequences), which evolve more rapidly 
than their gag-pol regions. Groups of species sharing these 
sequences included M. beccarii/ M. textilis (group 3), M. acuminata/ 
M. omata (group 5), and M. acuminata/ M. omata/M. balbisiana 
(group 7), which is in agreement with overall read similarities 
between the species presented in Fig. 3. The largest number of 
species-specific clusters, including complete retrotransposon se- 
quences, was detected for E. gilktii, which was in agreement with 
the phylogenetic divergence of this species. 

To verify some of the differences in repeat composition revealed 
by the clustering analysis, three putative section or species-specific 
repeats were detected in a set of Musaceae species using Southern 
blot hybridization (Fig. 4C-E). Experimental results were in all 
cases in agreement with the output of bioinformatic analysis. The 
probe derived from CL16, classified as a Reina lineage of Ty3 1 
gypsy elements specific for the genome of E. gilktii, produced strong 
hybridization signals in species of the Ensete genus, with weak or no 
labeling of genomic DNAs of Musa species (Fig. 4C). An M. beccarii- 
specific tandem repeat found in the cluster CL51 also showed 
predominantiy species-specific hybridization pattern with only 
minor signals in related species from Callimusa sections (Fig. 4D). 
The probe derived from tandem-like repeat CL30 present in 
sequence reads from M. acuminata, M. omata, and M. balbisiana was 
confirmed to be specific for section Musa, which includes these 
three species (Fig. 4E). 
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Figure 2. All-to-all similarity comparison of sequence reads from six Musaceae species. (A) The barplot shows the total number of reads 
with detected similarity hits which is proportional to size of repetitive fraction of the genome. As the number of similarity hits to each read is also 
proportional to its copy number, reads derived from repetitive elements can be divided into low, medium and high copy number fractions. (B) The 
total number of similarity hits that correspond to number of read pairs with similarity. 
doi:10.1371/journal.pone.0098918.g002 



In addition to identifying species- and section-specific repeats 
based on their presence in different clusters, it was possible to 
reveal more subtle sequence variations even for repeats grouped 
into the same cluster. An example of this variability is presented in 
Fig. 5 A— B, showing a graph structure of the cluster CL18 
representing a family of Ty3 /gypsy elements belonging to the 



chromovirus CRM clade. In higher plants, transposition of this 
group of LTR-retrotransposons is targeted to centromeres, and 
this localization has also been reported for M. acuminata CRM 
elements [29]. The cluster graph is composed of nodes represent- 
ing individual sequence reads and edges connecting reads with 
similarities exceeding the specified threshold [24]. Since the node 
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Figure 3. Comparison of genomic abundance of analyzed reads in all six species. (A-O) Scatter plots show pairwise comparisons of all 
analyzed sequences between pairs of species. Each spot corresponds to one sequence read. For each sequence read, the number of similarity hits in 
each species is displayed (this number is proportional to genomic representation of a particular sequence). Red diagonal line marks the position of 
sequences with equiproportional genomic representations. Sequences with differential genomic representation between species deviate from 
diagonal. The 45S rDNA sequences are shown in red. (P) Graph summarizing the number of identified read similarities between and within genomes. 
Width of the lines connecting nodes of the graph correspond to the number of identified similarity hits between sequence reads from different 
species (straight lines) and within the same species (loops). 
doi:10.1371/journal.pone.0098918.g003 



distances are inversely proportional to read similarities, the graph 
layout reflects sequence variability of the element copies in the 
genome, and in the case of reads from different species, reveals its 
inter-specific variability. This variability (Fig. 5B) as well as 
phylogenetic analysis of reverse transcriptase (RT)-coding se- 
quences extracted from the reads (Fig. 5C) were in agreement with 
phylogenetic reltionships of the species (Fig. 1). In addition, the 
RT-based phylogenetic tree provided better discrimination of 
sequences from closely related species and contained some species- 
or section-specific branches with shorter edges, implying recent 
amplification of CRM elements in M. acuminata, M. ornata, and M. 
balbisiana. 

Preparation of repeat databases and their use for repeat 
annotation in assembled genomes 

Sequence databases specific for various types or families of 
repetitive elements were prepared by merging reads from clusters 
with the same annotations. These databases can be used for 
similarity-based repeat identification in assembled sequences, as 



implemented, for example, in our Profrep server (http://w31amc. 
umbr.cas.cz/profrep/public/) [30]. Compared to approaches 
using databases of representative elements or consensus sequences, 
collections of reads gathered from clustering analysis provide 
better sensitivity for detecting divergent repetitive elements 
because they better represent their sequence variability. Examples 
of repeat annotation of randomly selected tracks of M. acuminata 
genome assembly [20] based on similarity hits to repeat-specific 
collections of sequence reads are provided in Fig. 6, S2, and S3. 
Identified repetitive regions (Fig. 6B) were mosdy in agreement 
with the assembly annotation (Fig. 6C). However, it was possible to 
assign specific repeat types to many regions listed as unclassified 
repeats in the assembly annotation (e.g. most Tjl/copia Maximus 
elements in Fig. 6 or Tyl/copia Angela in Fig. S2). An additional 
benefit of our approach is the possibility to visualize abundance of 
corresponding repeats in other Musaceae species. For example, 
Angela elements present in the analyzed region of chromosome 9 
are well conserved in all Musaceae species including E. gilletii, 
whereas other repeats show larger variations in their abundance 
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Figure 4. Comparative analysis of Musaceae species based on the cluster composition. (A) Sequence composition of the largest clusters is 
shown. The size of the rectangle is proportional to the number of reads in a cluster for each species. Bar plot in the top row shows the size of the 
clusters as number of reads. Color of the rectangles correspond to the type of the repeat. Upper lines label groups of clusters discussed in the text. 
The percentage of reads included in the group is shown in parentheses. (B) The presence of mobile element protein domains in the contig assembled 
from sequences within the cluster. Only clusters that were annotated are shown. (C-E) Validation of clustering results by Southern blot. Genomic DNA 
from 15 species was probed with sequences derived from clusters CL16, CL51. and CL30. The lanes contain DNA from 1/M. acuminata ssp. zebrina (ITC 
0728), 21 M. acuminata ssp. malaccensis (ITC 0250), 3M acuminata ssp. burmannicoides (ITC 0249), 4/M. ornata (ITC 0637), SIM. mannii (ITC 141 1), 6/M. 
ornata (ITC 0528), 7/M. balbisiana (ITC 1120), 8/M. balbisiana ('Pisang Klutuk Wulung'), 9/M. balbisiana (ITC 0247), 10/M. peekelii (ITC 0917), 11/M. 
maclayi (ITC 0614), 12/M. textilis (ITC 0539), 13/M beccarii (ITC 1070), 14/E ventricosum (ITC 1387), and 15/F. gilletii (ITC 1389). 
doi:1 0.1 371 /journal. pone.0098918.g004 



(Fig. 6A). On the other hand, our approach failed to detect some 
of the small regions annotated as DNA transposos or unclassified 
repeats in the M. acuminata assembly (Fig. 6B-C), probably due to 
their very small genomic abundance, which resulted in no hits to 
our sequence read databases. 

Discussion 

Identification of repetitive elements using graph-based cluster- 
ing of sequence reads is one of the novel bioinformatic tools 



specifically designed to utilize the power of next generation 
sequencing technologies [24,25]. This approach proved to be 
efficient in global repeat characterization in complex plant 
[24,31,32] and animal [33] genomes and in investigation of 
repeat composition of individual chromosomes [30,34] or their 
compartments [35,36]. This study employs repeat clustering 
methodology for comparative analysis of multiple genomes, 
including species representing two genera of the Musaceae family. 
It extends the previous survey of M. acuminata repeats [19] by 
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Figure 5. Variability of sequences within cluster CL1 8. (A) Sequence reads are represented by nodes of the graph and reads with identity of at 
least 90% with minimal overlap of 1 10 nt are connected by lines. Graph layout was calculated using the 3D version of the Fruchterman and Reingold 
algorithm [55] from which a 2D projection is shown. Reads are colored based on their similarity to conserved coding domains of LTR 
retrotransposons. Reads from LTR regions are colored by light blue. (B) Nodes of the graph are colored based on their species of origin. The six 
identical graphs show reads derived from each species in red; remaining reads are gray to highlight species-specific parts of the graphs. The parts of 
the graphs that represent the most variable sequence regions in CRM CL18 element, which can differentiate between species, are labeled by black 
arrows. (C) Maximum-likelihood tree based on nucleotide alignment of sequences covering the reverse transcriptase protein domain of CRM CL18. 
Sequence read names are colored based on the species of origin. 
doi:10.1371/journal.pone.0098918.g005 



analyzing more sequence data and investigating representatives of 
the genus covering its two taxonomic sections Musa and Callimusa 
[9] and three banana genomes (A, B, and T). It also complements 
information about repeat composition gathered from the current 
genome assemblies of M. acuminata and M. balbisiana [20,26]. 

Although highly and moderately repeated sequences can reach 
up to 80-85% in plants with larger genomes [37,38], their 
proportions in the investigated Musaceae species were smaller (66- 
71%). Such repeat proportions are slightly above repeat content 
estimated in species with relatively small genomes, including Oryza 
(25-66%), Vitis vinifera (41.4%), Sorghum bicolor (61%), Malusxdo- 
mestica (67%), and Nelumbo nucifera (50%) [39-43]. The observed 
dominance of LTR-retrotransposons in the fraction of highly 
repeated sequences is a common feature of higher plant genomes 
where retroelements represent one of the major forces driving 
genome size evolution [38,44,45]. Although individual retro- 
element families varied in genome proportions in the investigated 
Musaceae species, there was no significant correlation of their 
abundance with genome size variation. Thus, genome size 
differences in the studied species cannot be attributed to simple 
amplification of particular repetitive element as seen in some other 
plant genomes [38,39]. Rather, genome size was affected by joint 
activities of more transposable element (TE) lineages. 

The observed sequence variation and quantitative differences 
among individual TE lineages correspond well to previous results 
on phylogeny of the Musaceae family. The highest similarity of 
repeats within the groups of M. beccarii/M. textilis and M. 
acuminata/ M. ornata or M. acuminata/ M. omata/M. balbisiana are in 
agreement with previous results that showed close phylogenetic 
relationships of these species [6,8,46] and support the new 
taxonomy of Musaceae [9] . In general, sequence and quantitative 
differences were proportional to diversification of species of the 
Musaceae family (compare Fig. 1, 4, and 5). This is in agreement 



with previous comparative studies of repeats in genomes of the 
Orobanchaceae family [47] and genus Oryza [39]. Overall, the 
present and our previous studies show that repeat analysis can 
support the analysis of evolutionary relationships. 

Clustering-based repeat analysis employed in this study provides 
a useful alternative to repeat quantification from genome 
assemblies. Comparison of our shotgun data with the published 
assembly of M. acuminata DH-Pahang [20] confirmed that 
repetitive sequences are under-represented in the assembly 
anchored into 1 1 Musa chromosomes, while repeats are over- 
represented in remaining 30% sequences of un-anchored assembly 
(Fig. S4). However, in general, we observed good agreement 
between our annotation of M. acuminata clusters and curated TEs 
in the DH-Pahang assembly. Approximately 90-95 % of elements 
annotated in the genome assembly comply with the annotation 
based on our clustering results. Additionally, estimates of total 
abundance of LTR TEs in DH-Pahang and our estimates for M. 
acuminata are very close in the cases of Tyl/copia and Ty3/ gypsy 
(Table 2). The same Tyl/copia and Ty3 /gypsy lineages were 
identified in genome assembly with exceptions of Tyl I copia Ivana, 
which was newly pinpointed in this study. Graph-based clustering 
allowed us to assign all identified LTR retrotransposons into 
lineages, while in genomic assembly about 46% and 24% of Tyl I 
copia and Ty3 /gypsy, respectively, were not further classified. This is 
one of the benefits of graph-based clustering, as the sequence 
clusters contain both sequences derived from complete autono- 
mous TEs and continuum of incomplete and mutated TEs copies, 
which could be difficult to detect when only similarity search 
against database of known repeats is used. On the other hand, the 
amount of DNA transposons and LINEs estimated in the present 
work is much lower compared with the estimates based on the 
genome assembly. To explain this difference, we have compared 
the annotation of the DH-Pahang assembly with all our M. 
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Figure 6. Analysis of 100 kb genomic region of the DH-Pahang assembly using Profrep. Genomic sequence Chr9:20,1 50,000-20,250,000 
together with repeat annotation was obtained from the Banana Genome Hub (http://banana-genome.cirad.fr/) and analyzed using the Profrep tool 
against our Musaceae repetitive sequence databases. (A) Six tracks show the numbers of similarity hits against reads from six Musaceae genomes as 
calculated by Profrep. (B) Annotation of genomic region based on our M. acuminata repeat annotation and Profrep analysis. (C) Annotation of repeats 
in the DH-Pahang genome obtained from the Banana Genome Hub. 
doi:10.1371/journal.pone.0098918.g006 
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acuminata sequences. The similarity search revealed that the large 
fraction of sequences annotated as DNA transposons and LINEs in 
the DH-Pahang assembly provide similarity hits to unclustered 
sequences or to small unannotated clusters in our data (data not 
shown). Consequendy, it seems that we have missed these sequences 
in our annotation. Some differences in repeat abundance estimates 
can be also attributed to incompleteness of assembly and biased 
composition of sequences in the genome assembly. 

Conclusions 

The present study provides a detailed insight into the 
composition and diversity of repeats in genomes representing the 
family Musaceae. Next generation sequencing with genome 
coverage greater than 10% enabled annotation and quantification 
of repeats that form 30—45% of the Musaceae genomes. The 
remaining part of the genome consists of unidentified repeats 
(~30%) and low- and single-copy sequences (26 — 45%). We show 
that there is a prevalence of Tyl/copia elements in all Musa 
genomes, with a majority of Tyl/copia elements being Maximus/ 
SIRE and Angela. Most of the elements in Ty3 /gypsy family belong 
to the lineage of chromoviruses. The present study revealed 
significant divergence in repeat composition between the species of 
Musa, and the extent of repeat divergence was related to the 
estimated divergence dates of the species in the Musaceae family. 
We also demonstrated that database of repeats derived from graph 
based clustering is well suited for annotation of genome assemblies 
and can complement other repeat annotation methods. 

Materials and Methods 

Plant material, DNA isolation and sequencing 

In vitro rooted plants of most of the Musa and Ensete species used 
in this study were obtained from the International Transit Centre 
(ITC, Katholieke Universiteit Leuven, Belgium). The clone 
'Pisang Klutuk Wulung' of M. balbisiana was obtained from 
CIRAD (Guadeloupe) as rooted plants. Plants were transferred to 
soil and maintained in a greenhouse. 

Genomic DNA of M. acuminata 'Calcutta4' (ITC 0249), M. 
balbisiana 'Pisang Klutuk Wulung', M. ornata (ITC 0637), M. textilis 
(ITC 0539) and M. beccarii (ITC 1070) was prepared from nuclei 
isolated from healthy young leaf tissue according to [48] . Intact 
nuclei of E. gilletii (ITC 1 389) were isolated by flow cytometric 
sorting following the protocol of [49]. Isolated nuclei were 
incubated with 40 mM EDTA, 0.2% SDS and 0.25 ng/u.1 
proteinase K for 5 hrs at 37°C, and nuclear DNA was purified 
by phenol/ chloroform extraction. 

Roche/454 shotgun sequencing libraries were prepared by the 
GS Titanium library preparation kit (454 Life Sciences, a Roche 
company, Branford, USA). The single-stranded libraries were 
quantified by a qPCR assay and processed utilizing the GS Titanium 
SV/LV emPCR and XLR70 sequencing kits according to the 
manufacturer's instructions (Roche Diagnostics). Sequencing was 
performed on a half 70 x 75 picotiter plate for each Musa cultivar [8] . 
Sequence reads were divided into clusters using a graph-based 
method according to [24] with the difference in that the reads were 
trimmed to 200 nucleotides and reads from all species were clustered 
together. Computational tools used for clustering step are available 
at the public server (www.repeatexplorer.org) [25] . 

Repeat cluster annotation and repeat identification 

Several resources were used to manually annotate clusters. 
Reads from clusters were scanned for similarity to a database of 
plant repetitive elements with RepeatMasker [50] using databases 
that were improved by adding specific sequences derived from the 



banana genome based on our previous work [19]. Blastx and 
blastn [5 1] were used for similarity search against public databases 
and also against our database of protein domains derived from 
plant mobile genetic elements. Clusters represented as graphs were 
also analyzed using SeqGrapheR program (http:/ /cran.rproject. 
org/ web/ packages/SeqGrapheR/index.html). 

Musaceae phylogenetic tree construction 

Internal transcribed spacers (ITS) data obtained by [19] was 
used to construct a BioNJ tree based on the Jukes-Cantor model in 
the SeaView v4.2.3 program [52]. Phylogenetic trees were drawn 
and edited using the FigTree (http://tree.bio.ed.ac.uk/software/ 
figtree/) program. 

Construction of phylogenetic tree for RT domains 

Reads with similarity to the Ty3/ gypsy reverse-transcriptase 
domain were trimmed and aligned using MAFFT software [53], 
and the maximum-likelihood phylogenetic tree was estimated 
using the FastTree program [54] . The resulting alignment and tree 
are provided in Files SI and S2. 

Southern blots 

Genomic DNA of 15 selected Musaceae representatives was 
prepared from nuclei isolated from healthy young leaf tissue. Aliquots 
of genomic DNA samples corresponding to 1 6 x 1 0 6 of nuclear genomes 
were digested using Dral, EcoKV, Rsal or Mspl restriction enzymes, 
size-fractionated by 1.2% agarose gel electrophoresis, and transferred 
onto Hybond N+ nylon membranes (Amersham Biosciences, Bath, 
UK). Biotin-labeled oligomers (file S3) were used as probes. The 
Southern hybridization was done at 68 °C overnight followed by 
stringent washes (stringency 90%). Signals were detected using the 
BrightStar BioDetect kit according to the manufacturer's instructions 
(Ambion, Austin, USA), incubated with chemiluminescent substrate 
(CDP-Star, Amersham Biosciences), and exposed to X-ray film. 

Resources 

Sequences were deposited in the Sequence Read Archive under 
accession ERX047938-ERX047944. 

Profrep with Musaceae repeat database is available at http:// 
w31amc.umbr.cas.cz/profrep/public. 

Supporting Information 

Figure SI Clustering of sequence reads. Top bar plot show 
the results of clustering based on the hierarchical agglomeration 
algorithm of all sequence reads. Each bar represents one cluster; the 
height and width of the bars correspond to the number of reads in 
the clusters. The Y-axis shows the percentage of the reads in the 
clusters and the X-axis shows their cumulative content. Bars are 
colored according to the major types of identified repeats. Only first 
106 largest clusters is shown. Six lower bar plots show the 
representation of reads in the clusters by individual species. 
(PNG) 

Figures S2 DH-Pahang genomic sequences together 
with repeat annotation were obtained from the Banana 
Genome Hub (http://banana-genome.cirad.fr/) and 
analyzed using the Profrep tool against our Musaceae 
sequence database. (A) Six tracks show the number of 
similarity hits against reads from six Musaceae genomes as 
calculated by Profrep. (B) Annotation of genomic region based on 
our M. acuminata repeat annotation and Profrep analysis. (C) 
Annotation of repeats in the DH-Pahang genome obtained from 
the Banana Genome Hub. 
(PNG) 
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Figure S3 DH-Pahang genomic sequences together with 
repeat annotation were obtained from the Banana 
Genome Hub (http://banana-genome.cirad.fr/) and 
analyzed using the Profrep tool against our Musaceae 
sequence database. (A) Six tracks show the number of 
similarity hits against reads from six Musaceae genomes as 
calculated by Profrep. (B) Annotation of genomic region based on 
our M. acuminata repeat annotation and Profrep analysis. (C) 
Annotation of repeats in the DH-Pahang genome obtained from 
the Banana Genome Hub. 
(PNG) 

Figure S4 Comparison of repetitive content in shotgun 
sequencing data and genome assemblies of M. acumi- 
nata and M. balhisiana. Each dot represents a sequence read. 
The X and Y axes show the normalized number of similarity hits 
detected in shogun data and assembly, respectively. 
(PNG) 

File SI DNA alignment of reverse transcriptase se- 
quences derived from the Ty3 1 gypsy CRM element from 
cluster CL18. 

(FAS) 
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